Data simulation from a DAG: Simulation of data from DAG (directed acyclic graph)

Description

The main task of this test is to provide a p-value PVALUE for the null hypothesis: feature 'X' is independent from 'TARGET' given a conditioning set CS. This test is based on the log likelihood ratio test.

Usage

rdag(n, p, s, a = 0, m, A = NULL, seed = FALSE)

Arguments

A number indicating the sample size.

A number indicating the number of nodes (or vectices, or variables).

A number in $(0, 1)$. This defines somehow the sparseness of the model. It is the probability that a node has an edge.

A number in $(0, 1)$. The defines the percentage of outliers to be included in the simulated data. If $a=0$, no outliers are generated.

A vector equal to the number of nodes. This is the mean vector of the normal distribution from which the data are to be generated. This is used only when $a>0$ so as to define the mena vector of the multivariate normal from which the outliers will be generated.

If you already have an an adjacency matrix in mind, plug it in here, otherwise, leave it NULL.

seed

If seed is TRUE, the simulated data will always be the same.

Value

A list including: A list including:

Details

In the case where no adjacency matrix is given, an $p \times p$ matrix with zeros everywhere is created. very element below the diagonal is is replaced by random values from a Bernoulli distribution with probability of success equal to s. This is the matrix B. Every value of 1 is replaced by a uniform value in $0.1, 1$. This final matrix is called A. The data are generated from a multivariate normal distribution with a zero mean vector and covariance matrix equal to $\left({\bf I}_p- A\right)^{-1}\left({\bf I}_p- A\right)$, where ${\bf I}_p$ is the $p \times p$ identiy matrix. If a is greater than zero, the outliers are generated from a multivariate normal with the same covariance matrix and mean vector the one specified by the user, the argument "m". The flexibility of the outliers is that you cna specifiy outliers in some variables only or in all of them. For example, m=c(0,0,5) introduces outliers in the third variable only, whereas m=c(5,5,5) introduces outliers in all variables. The user is free to decide on the type of outliers to include in the data.

References

Colombo, Diego, and Marloes H. Maathuis (2014). Order-independent constraint-based causal structure learning." The Journal of Machine Learning Research 15(1): 3741--3782.

Examples

Run this code

y <- rdag(100, 20, 0.2)
x <- y$x
tru <- y$G 

mod <- pc.con(x)
b <- pc.or(mod)
plotnetwork(tru) 
dev.new()
plotnetwork(b$G)

Run the code above in your browser using DataLab